PCORI Methods contract ME-2018C1-11287, was awarded to co-principle investigators Toan Ong PnD and Michael Kahn MD, PhD at the University of Colorado Anschuz Medical Campus (CU-AMC) to implement and study the use of incremental record record linkage techniques for both clear text record linkage (CTRL) and privacy preserving record linkage (PPRL). Using data sets obtained from the Colorado Congenital Heart Disease Registry (COCHO: PI: Teressa Crume PhD) and from Children's Hospital Colorado (CHCO) and UCHealth (UCHealth), this body of work compared record linkage performance using standard (bulk) linkage (CTRL, PPRL) and incremental linkage (iCTRL, iPPRL). The objective was to determine if incremental record linkage had linkage accuracy similar to bulk linkage. Because incremental linkage moves much less data between a data provider and a linkage Honest Broker, incremental methods are to be preferred if performance is equivlent. More information about this body of work can be found on the project's public facing GitHub site. Technical details are posted on in the project's GitHub wiki.
This Jupyter Lab notebook embodies the analytics used to explore Aim 4 of the above PCORI Methods contract.
Aim 4: Calculate and compare data quality (DQ) measures of completeness density, and plausibility in unlinked and linked data using temporally partitioned COCHD data sets created in Aim 4.
This notebook uses a data set containing full personal health information (PHI) as defined by the Department of Health and Human Services HIPAA regulations. Thus, the underlying data sets cannot be made available. This notebook only runs within the secure EUREKA analytics environment maintained by Health Data Compass in the Colorado Center for Personalized Medicine at CU-AMC. However, the logic for analyzing data quality before and after record linkage is generic. A sample data set based on synthetic data will be added to this notebook at a future date.
# Select which record linkage method to be analyzed
ctrl = ['job_28433','ctrl']
ictrl = ['job_28798','ictrl']
ipprl = ['job_26137','ipprl']
pprl = ['job_27970','pprl']
rl_type = ipprl
if rl_type[1] in ['ctrl','pprl']:
incremental = False
else:
incremental = True
The core DQ idea is to compare DQ measures using unlinked rows versus linked rows.
This work defines four data quality measures that can be calculated for both linked and unlinked data:
Many other DQ concepts exist but not all have equivalent computational analogues in both unlinked and linked data. The most difficult issue in creating DQ measures that are comparable between unlinked and linked data is determining the correct denominator to use across measures and runs. Each DQ measure defined here as its denominator described in the Python function that calculates its values.
ALL ANALYSES ARE PERFORMED ON THE COHORT OF PATIENTS THAT PARTICIPATED IN AT LEAST ONE LINKAGE. PATIENTS THAT DID NOT LINK ("SINGLETONS") ARE REMOVED FROM THESE CALCULATIONS.
Justification: This study is examining how record linkage alters DQ measures. Patients who never link are not the focus of this study. Also since the number of linkages is much smaller than non-linkages, removing the non-linked patient allows DQ changes to be seen.
TECHNICAL NOTE: Network_IDs associated with a UID may change across runs. Thus for the current run, the most recently assigned network_id should be used for each UID. Tofind this, need to find the last run_id with an assigned network_id for every UID. The network_id assigned as the last_run_id is the assigned network_id for the current run. Thus, we query for max(run_id) group by UID and use that to assign max_nid.
EXAMPLE: UID 10 is assigned NID=1111 in Run 1, assigned NID=3333 in Run 2, and assigned NID=6666 in the final run.
#Imports
import os
from dotenv import load_dotenv
import pandas as pd
from sqlalchemy import create_engine, text
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.gridspec as gridspec
import seaborn as sns
from datetime import datetime
#Globals
# Graphics globals
plt.style.use('classic')
sns.set_context('paper')
sns.set_style("whitegrid")
sns.set(font_scale=1)
#
%matplotlib inline
%load_ext sql
# Local execution
# Environment variables
# Local .env file only has one variable named DOTENV that is a full path to the real environment variables
load_dotenv()
real_dotenv=os.getenv('DOTENV')
load_dotenv(real_dotenv)
#Debugging options here: postgres.......
#debug=('postgres')
debug=('postgres')
print("Run DT: ",datetime.now())
TODO: Create a pictureof this function
def global_linkage_stats():
query = """
with nid_sid as (
select distinct ni.run_id, ni.network_id as nid, rp.study_id as sid
from aim4.network_id ni join aim4.merged_source ms on ni.uid=ms.uid
join aim4.raw_person rp on ms.id::int = rp.study_id
)
, sid_groups as (
-- counts of studyids per NID, partion() needed to keep studyid
select run_id, nid, sid, count(sid) over (partition by run_id,nid) as cardinality
from nid_sid
)
-- Filter out STUDYIDs that do not participate in at least one record linakage (Count(studyid>1) per NID)
, linked_sids as (
select run_id, nid, sid, cardinality
from sid_groups
where cardinality > 1
)
, nid_sid_counts as (
select run_id, cardinality, count(distinct nid) as n_nid, count(distinct sid) as n_sid
from linked_sids
where cardinality > 1
group by run_id, cardinality
order by run_id asc, cardinality asc, count(nid) asc
)
, nid_studyid_totals as (
select run_id, cardinality, n_nid, n_sid
,sum(n_nid*cardinality) over (partition by run_id) as sid_total
, sum(n_nid) over (partition by run_id) as nid_total
from nid_sid_counts
)
, nid_sid_pct as (
select run_id, cardinality, n_nid, n_sid, sid_total, nid_total, n_nid/nid_total as pct_nid, n_sid/sid_total as pct_sid
from nid_studyid_totals
order by run_id asc, cardinality asc
)
select run_id, cardinality, n_nid, n_sid, sid_total, nid_total, pct_nid, pct_sid
from nid_sid_pct
"""
return query
Completeness calculates the presence/absence of a data value without regard to its value. For unlinked rows, either a value is present or it is not. For linked data, a value in at least one linked member of the linked set is sufficient to say that a value is present for that link.
Completeness is reported as a percent: # with value present / Total #. The numerators & denominators are different for unlinked and linked:
NOTE: This metric does not look if multiple linked values in linked data agree. This feature is examined in value density.